{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1><center>Decision Tree</center></h1>\n",
    "# 1.Introduction\n",
    "Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model (like a constant) in each one. They are conceptually simple yet powerful and easily interpreted. Assuming that we have partition the feature space into M regions as $R_1,R_2,\\cdots,R_M$,the corresponding regression model predicts the target y with a constant $c_m$ in region $R_m$, that is\n",
    "$$\n",
    "\\hat{f}(\\vec{x})=\\sum_{m=1}^Mc_m\\mathbb{1}\\{\\vec{x}\\in R_m\\}\n",
    "$$\n",
    "<img src=\"imgs/1.png\" alt=\"drawing\" width=\"300\"/>\n",
    "# 2. Regression Trees\n",
    "We now turn to the question of how to grow a regression tree with training data $\\{\\vec{x}_i,y\\}_{i=1}^N$. The algorithm needs to decide on the splitting features and splitting points, and also what topology the tree should have. Suppose first we have a partition into $M$ regions $R_1,R_2,\\cdots,R_M$, and we model the response as a constant $c_m$ in each region\n",
    "$$\n",
    "\\hat{f}(\\vec{x})=\\sum_{m=1}^Mc_m\\mathbb{1}\\{\\vec{x}\\in R_m\\}\n",
    "$$\n",
    "It is easy to see that the best $\\hat{c}_m$ is just the average of $y_i$ in region $R_m$\n",
    "$$\n",
    "\\hat{c}_m=\\sum_{i=1}^N y_i \\mathbb{1}(\\vec{x}_i \\in R_m)\n",
    "$$\n",
    "However, finding the best binary partition in terms of minimum sum of squares loss is generally infeasible, which is a NP-hard problem. We need to proceed with a greedy algorithm.\n",
    "\n",
    "Starting with all of the data in the tree root, consider a splitting feature $j$ and split point $s$, and define the pair of half-planes\n",
    "$$\n",
    "R_1(j,s)=\\{\\vec{x}|x_j\\leq s\\} \\quad \\text{and}\\quad R_2(j,s)=\\{\\vec{x}|x_j>s\\}\n",
    "$$\n",
    "Then we seek the splitting variable $j$ and split point $s$ that solve\n",
    "$$\n",
    "\\underset{j,s}{min}\\left[\\underset{c_1}{min}\\sum_{\\vec{x}_i \\in R_1(j,s)}(y_i-c_1)^2+\\underset{c_2}{min}\\sum_{\\vec{x}_i \\in R_2(j,s)}(y_i-c_2)^2\\right]\n",
    "$$\n",
    "For any chioce $j$ and $s$, the inner minimization is solved by\n",
    "\\begin{align}\n",
    "\\hat{c}_1=ave(y_i|\\vec{x}_i \\in R_1(j,s)) \\\\\n",
    "\\hat{c}_2=ave(y_i|\\vec{x}_i \\in R_2(j,s))\n",
    "\\end{align}\n",
    "For each splitting variable, the determination of the split point s can be done by scanning through all of the inputs. Then the determination of the best pair $(j,s)$ is feasible.\n",
    "\n",
    "Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions. We should decide how large should we grow the tree. Clearly a very large tree might overfit the data, which a small tree might not capture the important structure.Tree size is a tuning parameter governing the model’s complexity, and the optimal tree size should be adaptively chosen from the data. One approach would be to split tree nodes only if the decrease in sum-of-squares due to the split exceeds some threshold. This strategy is too short-sighted, however,since a seemingly worthless split might lead to a very good split below it.\n",
    "\n",
    "The preferred strategy is to grow a large tree $T_0$, stopping the splitting process only when some minimum node size is reached. Then this large tree is pruned using *cost-complexity pruning*. We define the cost complexity criterion\n",
    "$$\n",
    "C_{\\alpha}(T)=\\sum_{m=1}^{T}\\left[\\sum_{\\vec{x}_i\\in R_m}(y_i-\\hat{c}_m)^2\\right]+\\alpha T\n",
    "$$\n",
    "The tuning parameter $\\alpha\\geq 0$ governs the tradeoff between tree complexity and its fit performance.\n",
    "# 3. Classification Trees\n",
    "If the target is a classification outcome taking values $1,2,\\cdots,K$, the only changes needed in the tree algorithm pertain to the criteria for splitting nodes and pruning the tree. For regression we used the squared-error, which is not suitable for classification. In a node m, representing a region $R_m$ with $N_m$ observations, let\n",
    "$$\n",
    "\\hat{p}_{mk}=\\frac{\\sum_{\\vec{x}_i \\in R_m} \\mathbb{1}(y_i=k)}{N_m}\n",
    "$$\n",
    "the proportion of class $k$ observations in node m. We classify the observations in node m to class $k(m)=argmax_k\\hat{p}_{mk}$, the majority class in node m. Different meansures node impurity include the following\n",
    "* Misclassification error: $1-\\hat{p}_{m\\,k(m)}$\n",
    "* Gini index: $\\sum_{k=1}^K\\hat{p}_{mk}(1-\\hat{p}_{mk})$\n",
    "* Entropy: $-\\sum_{k=1}^K \\hat{p}_{mk}log\\hat{p}_{mk}$\n",
    "\n",
    "We need to weight the node impurity measures by the number $N_{m_L}$ and $N_{m_R}$ of observations in the two child nodes created by splitting node $m$.\n",
    "# 4. Other issues\n",
    "## 4.1 Categorical features\n",
    "When splitting a categorical feature having $q$ possible unordered values, there are $2^{q-1}-1$ possible partitions of the $q$ values into two groups, and the computations become prohibitive for large $q$. However, in a binary classification problem, this computation simplifies. We order the categorical feature according to the proportion falling in outcome class 1. Then we split this categorical feature as if it were an ordered feature. One can show this gives the optimal split, in terms of entropy or Gini index ,among all possible $2^{q-1}-1$ splits. The proof is given by Breiman (1984). This result also holds for a quantitative outcome and square error loss—the categorical feature are ordered by increasing mean of the outcome.The proof for quantitative outcomes can be found in Fisher(1958). LightGBM has used this trick for categorical features for accuracy and speed consideration. Also, this trick can is identity to [target encoding]( http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html)\n",
    "## 4.2 Missing  values\n",
    "Suppose our data has some missing values in some observations. For tree-based models, there are two approaches. The first is applicable to categorical predictors: we simply make a new category for “missing.” From this we might discover that observations with missing values for some measurement behave differently than those with nonmissing values. The second more general approach is the construction of surrogate variables. When considering a feature for a split, we use only the observations for which that feature is not missing. Having chosen the best feature and split point,we form a list of surrogate features and split points.The first surrogate is the feature and corresponding split point that best mimics the split of the training data achieved by the primary split. The second surrogate is the feature and corresponding split point that does second best, and so on. When sending observations down the tree either in the training phase or during prediction, we use the surrogate splits in order, if the primary splitting predictor is missing. Xgboost has implement a similar approach which you can refer to [this paper](https://arxiv.org/abs/1603.02754)\n",
    "## 4.3 Problem with Tree\n",
    "One major problem with trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. The major reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it. Another limitation of trees is the lack of smoothness of the prediction surface. Therefore tree-based models do not predict very accurately compared to other kinds of model."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
